Prioritized experience replay was proposed by T. Schaul et. al., and is widely used to speed up reinforcement learning (as far as I know).

Roughly speaking, mis-predicted observations will be learned more frequently. To compensate distorted probability, weight of learning is scaled to the opposite direction (cf. importance sampling).

In cpprb, `PrioritizedReplayBuffer`

class implements these functionalities with proportional base (instead of rank base) priorities.

\(i\)-th | Definition |
---|---|

Probability: \( P(i) \) | \( \frac{(P_{i}+\epsilon)^{\alpha}}{\sum _{j=0}^{N} (P_{j}+\epsilon)^{\alpha}} \) |

Weight: \( w_i \) | \( \left( \frac{1}{N} \frac{1}{P(i)}\right) ^{\beta} \) |

You can `add`

`priorities`

together with other environment. If no `priorities`

are passed, the stored maximum priority is used.

The `dict`

returned by `sample`

also has special key-values of `indexes`

and `weights`

. The `indexes`

are intended to be passed to `update_priorities`

to update their priorities after comparison with new prediction.

`PrioritizedReplayBuffer`

has hyperparameters `alpha`

and `eps`

at constructor and `beta`

at `sample`

, and their default values are `0.6`

, `1e-4`

, and `0.4`

, respectively. The detail is described in the original paper above.

Parameters | Default | Description |
---|---|---|

\( \alpha \) | \( 0.6 \) | Prioritized parameter. `0` means uniform. |

\( \beta \) | \( 0.4 \) | Compensation parameter. `1` means fully compensate (i.e. Importance Sampling) |

\( \epsilon \) | \( 10^{-4} \) | Small value to avoid `0` priority |

```
import numpy as np
from cpprb import PrioritizedReplayBuffer
buffer_size = 256
prb = PrioritizedReplayBuffer(buffer_size,
{"obs": {"shape": (4,4)},
"act": {"shape": 3},
"rew": {},
"next_obs": {"shape": (4,4)},
"done": {}},
alpha=0.5)
for i in range(1000):
prb.add(obs=np.zeros((4,4)),
act=np.ones(3),
rew=0.5,
next_obs=np.zeros((4,4)),
done=0)
batch_size = 32
s = prb.sample(batch_size,beta=0.5)
indexes = s["indexes"]
weights = s["weights"]
# train
# ...
new_priorities = np.ones_like(weights)
prb.update_priorities(indexes,new_priorities)
```

To choose prioritized sample efficiently, partial summation and minimum of pre-calculated weights are stored in Segment Tree data structure, which is written by C++ and which was an initial main motivation of this project.

To support multiprocessing, the Segment Tree can be lazily updated, too.